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ABSTRACT 

Source separation problems are ubiquitous in the physical 
sciences; any situation where signals are superimposed calls 
for source separation to estimate the original signals. In this 
tutorial I will discuss the Bayesian approach to the source 
separation problem. This approach has a specific advantage 
in that it requires the designer to explicitly describe the signal 
model in addition to any other information or assumptions 
that go into the problem description. This leads naturally to 
the idea of informed source separation, where the algorithm 
design incorporates relevant information about the specific 
problem. This approach promises to enable researchers to 
design their own high-quality algorithms that are specifically 
tailored to the problem at hand. 

1. UNDERSTANDING THE PROBLEM 

To gather information about the physical world, we deploy 
sensors to make measurements and detect signals. Our sen- 
sors, if properly designed, will collect information about the 
signals of interest. However, very often the signals of inter- 
est are comprised of a set of discrete signals, which have 
been superimposed during propagation, often with signals 
that are not of interest. Thus our sensors almost invariably 
detect a mixture of signals — some interesting and some non- 
interesting. In more straightforward applications, careful de- 
sign of the sensors and application of filters can limit the 
recordings to the signal of interest. However, when this is not 
possible, more extreme steps need to be taken. This leads to 
a class of problems called source separation problems. 

There is no limit to the complications that may arise. Su- 
perposition may be linear, or one of the infinite varieties of 
nonlinear superposition. If a set of sensors are used, there 
may be time delays due to propagation of the signals from 
each source to each detector, or there could be convolutions 
due to reflections or differences in propagation speed through 
intervening media coupled with diffraction. To presume to 
be able to construct a single algorithm that can deal with all 
of these imaginable cases is unrealistic. Instead, we must 
focus our efforts on developing a methodology for design- 
ing robust algorithms that are specific to the application at 
hand. Only then can we take advantage of the specific prior 
knowledge we possess about each problem to increase our 
chances of reaching an accurate and optimal solution. I call 
this approach informed source separation, which should be 
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contrasted with blind source separation, where very little is 
assumed to be known about the problem. 

The source separation problem can be viewed as an in- 
ference problem, where one models a set of detected signals 
as a mixture of a set of source signals. It is important to re- 
member that inference is not deduction — it doesn’t always 
work. In difficult problems, prior information goes a long 
way to help assure that we reach an accurate solution. This 
prior information can take many forms, and can come into 
the problem at several different points. I will show that this 
prior information can significantly transform the source sep- 
aration problem, and subsequently, the algorithmic solution. 

2. BAYESIAN PROBABILITY THEORY 

In this section, I give a brief description of the Bayesian 
methodology. I will focus on the use of probability theory 
to describe our knowledge, and leave the details of the prob- 
lem of searching the hypothesis space for the optimal solu- 
tion to other authors. The crux of the methodology is Bayes’ 
Theorem 

, , ,| , ,, , I ..p{data\model,I) 

p{model\data,I) = p(model\l) — — (1) 

p[data\I) 

where I represents our prior information. The probability on 
the left p{model\dat a, I) is called the posterior probability. 
It is the probability that a specific model accurately describes 
the problem given the data and our prior information I. The 
first term on the right p(model\I) is called the prior prob- 
ability, or prior for short. The prior describes the degree 
to which we believe a specific model is the correct descrip- 
tion before we see any data, and thus encodes our knowl- 
edge about the possible values of the model parameters. The 
term in the numerator p(data\model,I) is called the likeli- 
hood, which describes the degree to which we believe that 
the model could have produced the observed data. This term 
encodes both the process of making predictions with our hy- 
pothesized model and the process of comparing these predic- 
tions to our data, which is an important part of the scientific 
method. The term in the denominator p(data\I ) is called the 
evidence. In many parameter estimation problems, where we 
have a static model and are merely estimating the values of its 
parameters, this term simply acts a normalization factor. In 
problems where we are testing one of a set of several mod- 
els, this term becomes extremely relevant as it can indicate 
the degree to which a model is favored by the data. 

The space of all considered models is called the hypothe- 
sis space. Bayes’ Theorem turns the source separation prob- 



lem into a search problem, where we search the hypothe- 
sis space for the most probable model. We can also look at 
Bayes’ Theorem as a learning rule since tells us how to up- 
date our prior knowledge when we receive new data. What 
we have learned from this data combined with what we knew 
prior is described by the posterior probability. In the next 
section I will show how the Bayesian method applied to a 
well-defined set of prior information leads directly to Info- 
max ICA. 

3. FIRST EXAMPLE: INFOMAX ICA 

In this section I will demonstrate how the Bayesian method- 
ology allows one to derive Infomax ICA [1]. While this par- 
ticular derivation has been published previously [2], I present 
it here again in detail both to assist with understanding the 
later derivations in this paper, and also to clear up some com- 
mon misconceptions surrounding ICA and source separation. 
ICA is commonly considered to be a blind source separation 
algorithm because we make a minimal number of assump- 
tions. However, it is important point to note that no algo- 
rithm is truly blind, and that the assumptions we make — even 
if minimal in some sense — will have an affect on the perfor- 
mance of an algorithm when applied to a given problem. 

We begin by assuming that there are N sources whose 
signals propagate instantaneously to N distinct detectors. 
The signals are assumed to superimpose linearly so that each 
detector records a linear mixture of the source signals. Fur- 
thermore, the entire process is assumed to be noise-free. 
This leads to a simple mathematical model that describes the 
recorded signals in terms of the unknown sources 1 

N 

%it z (2) 

j= 1 

where x lt is the signal recorded at i th detector at time t, Sj t is 
the source signal emitted by the j' h source at time f, and Ajj 
is the “mixing matrix”. The elements of the mixing matrix 
serve to couple the sources to the detectors. Physically each 
Aij describes how the signal propagates from the source to 
the detector. While the physical interpretation is kept vague 
in the blind algorithm, we will see that the physical interpre- 
tation is quite useful when deriving informed source separa- 
tion algorithms. 

We can now apply Bayes’ Theorem (1) to express the 
probability of our model, A and s, given our data, x 

p(A,s|x,/)=p(A,s|/) p(x|A ; S ’ /) , (3) 

p( x K) 

where A represents the entire matrix, s represents all of the 
source signals emitted by the sources, and x represents all 
of our recorded data. Given the fact that the source signals 
are independent of the propagation, we can factor the prior 
probability p(A.s|7) into two terms p(A\I) and p(s\I). 

p(A,s|x,7) = p(A|/)p(s|/) p(xlA ’ S ) ’ /) . (4) 

Once we assign these probabilities, the problem is in some 
sense solved. All we will need to do is to search all possi- 
ble values of the matrix A and the source waveshapes s to 

1 1 have purposefully kept this in component form so that it may be more 

easily compared with other algorithms presented later in this tutorial. 


find the case that is most probable. 2 When we perform this 
search, the evidence in the denominator will never contribute 
to the calculation since it doesn’t depend on the model pa- 
rameters. So we can simplify the problem further by writing 
(4) as a proportionality 

p(A,s|x,/) <xp(A|/)p(s|/)p(x|A,s,/). (5) 

Now with the basic model in hand, we can construct a 
likelihood function. In this case, we are assuming that the 
recording process is noise-free; thus we will assign a delta- 
function likelihood function for each datum point 

N 

p(x it \A,s t ,I) = 5(x it - Y, A ‘j s j‘)’ ( 6 ) 

}= i 

where I have used s t to represent all of the N source ampli- 
tudes emitted at time t. This delta function likelihood states, 
very strongly, that we believe that our model of the source 
separation problem (2) is correct. The recordings are inde- 
pendent of one other, so the likelihood function for our entire 
data set is merely the product of likelihoods (6) for each de- 
tector and each time step 


f(x|a,s,/) = nn 5 ( x '' _ (r) 

i=lt=l 7=1 

where T is the number of time steps. 

Next we assume that the probability density of the ampli- 
tude of the individual source signals has a positive kurtosis 
(also known as leptokurtotic or super-Gaussian). With this 
assumption, we can assign a prior probability for the ampli- 
tudes of the signals emitted by the sources. Without such a 
prior, this problem has an infinite number of perfectly good 
solutions. This prior information will serve to make the prob- 
lem soluble in cases where it is correct, while risking the pos- 
sibility of incorrect solutions in cases where this assumption 
does not hold. We will write 


p(sjt\ I )=q j {s j ,), (8) 

where qj(sj t ) is the probability that the j' h source could have 
a given amplitude at any time t. We could easily follow Bell 
& Sejnowski [1] and assign the derivative of a sigmoid func- 
tion, which is a leptokurtotic density function. However, for 
the purposes of generalization, we will just write it as qj. As- 
suming that the sources are independent of one another, we 
have 

rtsi/)=nn®M- (9) 

j=U=l 

We now assume that we know nothing about the mixing 
matrix. We will encode this knowledge by assigning a uni- 
form prior 3 for the value of any given matrix element Ajj as 
long as it is within a “reasonable” range 


p{Aij\I) = 


C if Amin ^ Aij ^ A max 

0 if Aij <C Amiri’) A m ax ^ Aij 


( 10 ) 


2 In practice, conducting this search is often the most difficult part of the 
problem. 

3 The astute reader will recognize that each matrix element acts as a scal- 
ing parameter in the problem. For this reason, a more accurate noninforma- 
tive prior would be the appropriate Jeffrey’s prior for the matrix A. 



where c = (A max — A min ) 1 . For the entire mixing matrix, we 
can assign a uniform joint prior 

p(A|/) = nf[p(Ay|/) (11) 

i=\j=\ 

I C if V Ajj , A m j n A ij A max 
0 if 3Ajj, s.t. Aij < A m i n , A mox > Ay 

where C = dV 2 . With our likelihood and priors all defined, 
we are now ready to re-write the posterior probability (5) and 
begin searching for the most probable parameter values. 

However, this search will not be easy. Much of the effort 
in Bayesian inference is to limit the number of parameters 
to search over, or to come up with a clever heuristic to per- 
form the search. Furthermore, it is often easier to work with 
the logarithm of the posterior (4). This neatly separates the 
posterior into a sum of the log priors plus the log likelihood. 

We will begin by reducing the number of parameters of 
interest, and will conclude by taking the logarithm. First we 
reason that if we knew the mixing matrix A, or better yet, 
its inverse A -1 , we could apply it to the data to recover an 
estimate of the source signals. Surely this is not ideal, but it 
is easier than searching the entire multidimensional parame- 
ter space, which would be N 2 +NT parameters. To do this 
we use the fact that probabilities sum to one, and marginal- 
ize over all possible values of the source signal amplitudes, 
written symbolically as 

P(A|x,7) = fds p(A,s|x,7) (12) 

oc jds p(A\I)p(s\I)p(x.\A,s,I) 

<*p{ A|/) J ds p(s|7)p(x|A,s,7) 

where the integral sign represents all NT integrals over each 
of the sj t . Substituting our likelihood (7), and source density 
prior (9), we have NT integrals to solve 

r r N T N 

j ds 1 1 • • • I dswf - T, A <J s jt)- ( i3 ) 

j j j=u=i j = i 


which when taking the logarithm gives 


logp(A|x,7) = K+ 


N T 

log p( A |7) — log detA + EE lOg?; 

j=U=l 



( 17 ) 


where K is the logarithm of the constant implicit in the pro- 
portionality. By varying A to maximize the log posterior 
above, we can solve for the optimal mixing matrix. The way 
this is done in ICA is to take the derivative with respect to the 
inverse of A and to use this in a gradient ascent learning rule. 
Specifically, if we assign the mixing matrix prior according 
to (11), and write Wjj = Aj\ we get the familiar Infomax 
ICA gradient ascent learning rule [1,2] 


AWjj = x 

,J dW; 


ij L 


N T 


— logdetA+ £ £lo gqA Y, A i/ 


j=U=l 


N 


Xit 


N T 

= Aji + EE x j< 

j=U = 1 


qi(uit) 


j 


(18) 


where u it = Y!i=\A^x it . 

However, strictly speaking, this rule doesn’t lead to the 
optimal separation matrix W, since the maximum value of 
the posterior with respect to variations of A will not be iden- 
tical to the maximum value of the posterior with respect to 
A -1 . This is due to the fact that probability densities trans- 
form differently than functions. Since 


P( A 1 |x,7) =p(A|x,7) 


dA- 


dA 


if we define 


A = arg max p(A|x,7) 

A 

W = arg max p(A|x,7) 
a- 1 

W = arg max p( A _1 |x,7) 
A- 1 


(19) 


(20) 

( 21 ) 

( 22 ) 


in general, we have that 


The delta functions easily allow us to solve each of the in- 
tegrals simply by introducing a change of variables where 
Wit =X{, — L‘/=i AjjSjt. We then have that det A ds = d w, and 
that Sp = jjiLi Ajj l {xit — Wi t ), so that the integral becomes 


1 

detA 


[ ■ ■ ■ j dWNf nn^(E A i j 1 (x i t-Wi,))8(w i ,). 
’ ’ /=! r=l /— 1 


(14) 


The delta functions now all select w, r = 0, and we have as a 
result 


1 

detA 



(15) 


Substituting this result into (12) we get 


W/WandW/A- 1 , (23) 

where W is the optimal separation matrix, A is the optimal 
mixing matrix, and W is the Infomax ICA solution. Thus the 
inverse of the optimal estimate of the mixing matrix does not 
equal the optimal estimate of the separation matrix. 4 How- 
ever, the ICA solution is actually neither of these. If we were 
really interested in finding the most probable inverse of the 
mixing matrix, we should have used (19) to write the poste- 
rior for A -1 and solved for its most probable value (22). As 
a result the standard technique (18) and (21) leads to a biased 
separation. 

That being said, there is much one can learn from this 
derivation of the Infomax ICA algorithm. Certainly one ob- 
tains the same answer when deriving it from the information- 
theoretic viewpoint; this being due to the duality between 


F(A|x,7)°cp(A|7)-_- 



4 The classic example of this is the fact that the frequency at which a 
( 16 ) given blackbody spectrum has maximum energy density is different than the 
wavelength at which it has maximum energy density. 



probability theory and information theory [3]. However, the 
Bayesian derivation has several distinct advantages. First, all 
of the assumptions that go into the algorithm are made ex- 
plicit. We see that the sigmoidal nonlinearity in the original 
derivation is merely related to the derivative of the prior prob- 
ability for the source amplitude density. This answers one of 
the common questions that arises: Why does ICA have prob- 
lems separating pure sinusoids? The answer is clear; sinu- 
soids have bimodal amplitude histograms, which is a severe 
deviation from our prior expectation of the super-Gaussian 
prior probability that we have assigned explicitly, and Bell & 
Sejnowksi assigned implicitly [4], Modifications to this prior 
to allow for sub-Gaussian densities typically do not improve 
the situation mainly because they are essentially smoothed 
uniform densities, which are non-informative. If you want to 
separate sinusoids, you need to include this relevant informa- 
tion in the design of the algorithm. 

Second, why does ICA assume the same number of 
sources as detectors? In this derivation one can see that the 
integral is not analytically solvable if we do not make such an 
assumption. In addition, if we would have assumed that the 
recorded signals were noisy and assigned a Gaussian like- 
lihood, we would again not be able to perform the integra- 
tion analytically past the first integral. The noise-free square 
mixing matrix allows for an analytic marginalization over 
the source waveshapes resulting in a straightforward and ele- 
gant solution. However, this elegant solution will break when 
pushed too far. 

Last, another common question that arises is: Are 

Gaussian-distributed signals separable? Often the answer is 
‘yes’ — you just have to rely on additional or different prior 
information. This is why understanding the information that 
goes into the design of an algorithm is so important. It al- 
lows you to better understand the range of applicability of 
an algorithm and how to fix it when it doesn’t work. This is 
why I prefer the Bayesian approach to source separation. It 
requires you to make all of this explicit. 


4. INCORPORATING PRIOR KNOWLEDGE 


In this section I will demonstrate another advantage to the 
Bayesian approach. We will modify the algorithm to account 
for a simple piece of prior information. Let’s say that we 
know that the speeds of propagation of the signals remain 
instantaneous, but that the signals follow an inverse-square 
propagation law. Such knowledge implies that the coeffi- 
cients of the mixing matrix are dependent on the relative po- 
sitions of the sources and detectors. How can we use this 
information if we have no knowledge of the source-detector 
distances? 

First, if we did know the distance from source j to 
detector i, the mixing matrix element A\j would follow the 
inverse-square propagation law 


Aij = 


1 

4nrfj 


(24) 


where V is the spherical volume of radius R surrounding the 
detector. This is the prior probability that the source is at any 
position with respect to the detector. However, we only need 
the probability that the source is some distance r from the 
detector. We obtain this by marginalizing over all possible 
values of the angular coordinates 

r 2n rK 

p(r\I) = d(j) sinOdO r 2 p(r, 0,0|/) (26) 

Jo Jo 

p2tz r7l ^ 

= /o ^ 1 Sin6d6r2 4^ (2?) 

_ 3 r 2 

~ W' 

The prior on the source-detector distance is very reassur- 
ing since it is naturally invariant with respect to coordinate 
rescaling (change of variables). Specifically if we introduce 
a new coordinate system so that p = ar and P = ciR with 
a > 0, we find equating the probabilities around p + dp and 
r + dr that 


\p(p\I)dp\ = \p(r\I)dr\ 
P(P\I) \dp I =p(r\I) \dr\ 


P(P\I) = P( r \I) 
3 r 2 


I dP 

1 dr 


P(P\I) = 
P(P\I) = 


R 3 1 

3 ap 2 


dp | 
dr 1 


3 p 2 

p(p \l) = yr- 


(28) 


Since this prior is invariant with respect to coordinate rescal- 
ing, we can measure distances using any units we wish. 

We can now use this to derive a prior for the mixing ma- 
trix element. First write the joint probability using the prod- 
uct rule 

p{A ib r\I) = p(r\I)p(Ajj\r,I). (29) 

The first term on the right is the source-detector distance 
prior, and the second term is a delta function described by 
the hard constraint of the inverse-square law. 5 These assign- 
ments give 


p(Aij,r\I) = ^3- 8(Ajj — (Anr 1 ) 1 ). (30) 

We now marginalize over all possible values of r 

P(Aij\I) = £ dr £ 5 (Ay - (47D- 2 )- 1 ). (31) 


However, we may know that the source must be within some 
maximum distance R from the detector. If it could be any- 
where in the three-dimensional space surrounding the de- 
tector, we can assign a uniform probability for its position 
(r, 0, (j)) within any volume element of that space 

p(r,9,,f>|;) = I = -L (25) 


5 Some readers may wonder why I go through the difficulty of using delta 
functions rather than computing the Jacobians and just performing a change 
of variables with the probability densities as we did before when demon- 
strating invariance with respect to rescaling. The reason is that in more 
complex problems where the parameter of interest depends on multiple other 
parameters, the change of variables technique becomes extremely difficult. 
Care must be taken when using delta functions, however, since the argument 
needs to be written so that it is solved for the parameter of interest, in this 
case Ajj rather than another parameter such as r. 



To do this we will need to make a change of variables again 
by defining u = [Ajj — (4nr 2 )~ l ), so that 

f 2 = [(4tt) (A,y — w)] -1 (32) 

r 3 = [(4 n)(Aij-u)}- 3 / 2 (33) 

and du = (2nr i )~ l dr, which can be rewritten as 

dr = 2~ 3 ^ 2 (2 n)~ { l 2 (Aij — u)~ 2 / 2 du (34) 

giving us 

p(Aij\I) = 2- 4 K~ 3/ --j^ J du (Aij-uy 5/ ~8(u), (35) 

where u max = Ajj — (4 nR 2 )~ l . The delta function will select 
u = 0 as long as it is true that u max > 0 or equivalently that 
r < R. If this is not true, the integral will be zero. If this 
hard constraint of the sources being within a distance R of 
the detectors causes problems, your choice for R was wrong. 
The result is 

(36) 

so that the prior for the mixing matrix elements is propor- 
tional to A . . . Readers more familiar with statistics may 

note (and perhaps worry) that this prior is improper since it 
blows up as Ajj goes to infinity. This is not a practical con- 
cern as long as the sources are not allowed to got off to in- 
finity. Other readers may note that this prior depends explic- 
itly on the value of the maximum source-detector distance 
R. More accurate knowledge about the value of R will lead 
naturally to a more appropriate prior probability, resulting in 
more accurate source separation results. Once the value of R 
has been chosen, this prior can be inserted into (17) to gener- 
ate a source separation algorithm that accounts for this prior 
knowledge. 

As one can imagine, these algorithms can be made ar- 
bitrarily more detailed depending on the prior information 
available. If one has information about the absolute positions 
of the detectors, such as in a sensor web, and probable loca- 
tions of the sources, one can derive more accurate prior prob- 
abilities for the source-detector distances [5] 6 . This leads 
naturally to more accurate prior probabilities for the mixing 
matrix elements, which in turn lead to better results. 

In an ICA-style gradient ascent learning rule (17, 18) 
these mixing matrix priors act as an additive term biasing 
the update rule toward the solutions suggested by the prior 
knowledge. From this perspective, the prior can be viewed 
as a regularizer. However, one shouldn’t get too carried away 
with this viewpoint since rather than being devised in an ad 
hoc manner, these priors can be carefully designed based on 
the specific prior information possessed by the algorithm de- 
signer. It would have been very difficult to guess the prior we 
have just derived above. 


6 The author would like to thank Vivek Nigam for pointing out errors in 
the SPIE98 paper, which will be corrected in a future version available at 
http://www.arxiv.org/abs/physics/0205069 


5. SEPARATION AND LOCALIZATION 

Now that we have introduced the idea of the relative posi- 
tions of the sources and detectors, we can take this problem 
even further and attempt not only to separate the sources, but 
also to localize them. Here I present results from an earlier 
paper where we considered the relationship between source 
separation and source localization [6]. 

We consider the problem of neural source estimation in 
electroencephalography (EEG) where we have multiple neu- 
ral sources in the brain and multiple recording electrodes. 
Each source j emitting a signal ,v ; will have some three- 
dimensional position in the brain p ; . In addition, these 
sources are such that they often emit dipolar current fields, 
so we must also be concerned about their orientation q 7 . The 
mixing matrix A again describes the coupling between the 
sources and the detectors. In the case of electrophysiology, 
often much is known about this coupling since the electrody- 
namics of current flow through tissue is well-understood and 
can be modelled in detail using magnetic resonance imaging 
(MRI) derived head models. In this problem, the electric cur- 
rents propagate nearly instantaneously throughout the head 
and superimpose linearly resulting in signals x recorded by 
the detectors. Using Bayes’ Theorem, we can write the pos- 
terior symbolically as 

p(p,q,A,s|x,/) ocp(p,q,A,s|7)p(x|p,q,A,s,7), (37) 

where p represents the positions of all the sources in the 
model, and similarly for the other non-subscripted parame- 
ters. The model we have chosen is redundant in the sense 
that the mixing matrix depends on the relative positions and 
orientations of the sources to the detectors. This allows us to 
simplify some of the terms above, and factor the prior using 
the product rule 

p(p,q,A,s|x,7) °cp( x |A,s,/)x 

p(A|p,q,/)p(pj/)p(q|p,/)p(s|/). (38) 

These priors show that some model parameters are dependent 
on others, such as the prior for the mixing matrix. The ori- 
entations of the sources depend on their position in the brain 
since the orientations are determined by the histology of that 
particular neural source. 

Now if we assume that we know nothing about the 
source positions, nor how they affect the mixing matrix, the 
prior /u A|p. q, 7) reduces to p(A\I). Marginalizing over all 
source positions and orientations using uniform priors, we 
recover the basic source separation problem (4) 

p(A,s|x,7) °c 7>(x|A,s,7)p(A|7)p(s|7). (39) 

With the appropriate probability assignments, we could re- 
cover the Infomax ICA algorithm, or perhaps another source 
separation algorithm that is better suited for the job. 

However, let’s see what happens if we change our focus 
and concentrate on the source positions rather than the mix- 
ing matrix. We will describe the propagation of the signals 
from the sources to the detectors with a forward model that 
takes into account the electrodynamics of the physical situ- 
ation. For simplicity, we will write this symbolically as a 
function 

Ajj = F(d i ,p j ,q j ), 


(40) 



where d, is the position of the i th detector, p ; and q ; are the 
position and orientation of the j th source. This function will 
play a role in our prior probability for A t j 

M N 

f*(A|p,q,/) = | _ [| _ [5(A,' J -T’(d,',p / ,q / )), (41) 

i=ij=\ 


The remaining priors provide the potential for the intro- 
duction of a significant amount of prior information. In this 
demonstration however, I will simply assign uniform priors. 
Taking the logarithm of the posterior results in 

M T (y r- t 2 

logp(p,q,s|x,7) = + C > ( 4? ) 

i=\t= 1 ZO i 


where we are assuming that there are N sources and M de- 
tectors. 

Next we will assign a Gaussian likelihood to encode that 
we have uncertainties about how well the model describes 
the experimental data. The idea is to assume we know the 
expected squared deviation <7 2 between the predicted and ob- 
served results. Given that we know this expected squared de- 
viation, the principle of maximum entropy [7] says that the 
Gaussian distribution is the most honest quantification of this 
knowledge. 


M T 

p( x |A,s,/) = | _ [| _ [exp 


i= 1 t=l 


2(7? 


(42) 


It is important to note that this does not imply that we believe 
the noise is Gaussian distributed, it merely implies that we 
know something about the expected squared deviation. 7 If in 
fact we do not know the actual value of <7, we can always 
marginalize over it to obtain a more conservative probability 
density related to Student’s t distribution. 

We now marginalize the posterior to get rid of the nui- 
sance parameters An 


p(p,q,s|x,7) °c p(p|7)p(q|p,7)p(s|7) 

jdA p(x|A,s,7)p(A|p,q,7). (43) 

With our probability assignments (42) and (41), and writing 
Fij = f (d,. p ; . q ; ) the integrals become 
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where C is the logarithm of the implicit proportionality con- 
stant. Maximizing this log posterior results in minimizing 
the familiar chi-squared ‘cost function’ 


r = EE?4j?Uc, 

i=lt= 1 


2 a} 


(48) 


which is a common procedure in electromagnetic source lo- 
calization. 

From this example we see that based on the parameters of 
interest and the prior information we include, a source sep- 
aration problem can become a source localization problem. 
The lesson here is that the Bayesian formalism is the struc- 
ture that underlies not only source separation and source lo- 
calization problems, but rather signal processing in general. 
In fact, many familiar techniques — even the Fourier trans- 
form [8] — have their basis in the Bayesian methodology and 
can be significantly improved by understanding the underly- 
ing models and assumptions that go into the algorithm. 


6. BEYOND SEPARATION 

We can take these ideas further by developing signal mod- 
els that include parameters that allow us to characterize or 
describe the signals in different ways. In this example, we 
again consider EEG signals. Typically an experimenter will 
design an experiment and record data from multiple exper- 
imental trials. The standard analysis technique consists of 
averaging the data across trials to reduce the effects of noise 
(‘noise’ meaning signals that are either not understood or not 
interesting). The implicit signal model that is employed is 
the signal plus noise (SPN) model where we assume that we 
have a single stereotypic source waveshape s(t) that is pro- 
duced every trial in addition to ongoing noise 1 ] (t). 9 The data 
that is recorded in an electrode can be modelled as 


x r {t) = s(f) + ? 7 ,-(f), (49) 


which gives 


nn-p ^2 — 

r=lf=l L ztT i J 


where r indexes one of the R trials and t indexes the measure- 
ments at T discrete time points. 10 Using Bayes’ Theorem 
we have, 

p(s|x,7) ocp( x |s,7)p(s|7). (50) 


Simplifying the notation further by writing x u = YULi Fil s lt > 8 
the marginalized posterior is then 


Relying on arguments laid out in the previous section, I will 
assign a Gaussian likelihood, and a uniform prior for ,v. The 
log posterior is then 


P(p,q,s|x,7) °cp(p|7)p(q|p,7)p(s|7)x 
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7 Jaynes has an excellent chapter where he works through this common 
misconception [7] ch. 7, specifically 7.7. 

8 Note that x are the predicted recordings based on the sources s and the 
forward model. 


l°gp(s|x,7) 


R T 


EE 


Q>(f)-s(f )) 2 

2(7 2 


+ C. 


(51) 


9 I have changed notation here slightly where I am now writing these 
signals as functions of time. This notation is more clear later when we are 
required to describe latency shifts of the neural response. 

10 Note that the Bayesian methodology does not require that these mea- 
surements be equally spaced in time. This is a distinct advantage when 
dealing with ‘missing data’ problems. 



We can find the maximum of the log posterior by taking the 
derivative with respect to each s(t) and setting it equal to 
zero. For a particular time t' we have 


d d ( JL (x r (t) - s(t)) 2 

wi 8 ' ,(s|x - ’ = wi v SS “ ^ — +c 

= -o- 2 Y J (x r (t')-s{t')). (52) 


Setting this equal to zero and solving for s(t') we get 
■*(0 = 


(53) 


which shows that if you believe that there is only one stereo- 
typic signal in the data then averaging the data over trials will 
yield the optimal estimator of the source signal. 

However, researchers are well aware that there are mul- 
tiple simultaneous signals, and that these signals vary from 
trial-to-trial. We have shown that both amplitude and latency 
variability play a role in the variations of the signals emitted 
by neural sources [9], This has led us to a new model of the 
recorded signal from a set of neural sources 

N 

•^r(0 = ^Xfi r S n {t T m -) + T]r(0> (54) 

n= 1 

where a m describes the amplitude scale of the n th component 
during the r ,h trial, and x nr similarly describes its latency shift 
forward or backward in time. This allows us to account for 
and to characterize amplitude changes and response delays 
in the neural responses during the course of an experiment or 
under different experimental conditions. This model assumes 
that each of the N sources has a distinct stereotypic wave- 
shape. In our work we have found that by simply describing 
these additional characteristics of the neural responses, we 
can separate source signals that vary differentially from trial 
to trial. The algorithm that results from this model, and our 
subsequent probability assignments, is called differentially 
Variable Component Analysis (dVCA) [10, 11, 12], To ac- 
commodate multiple detectors, we simply modify the signal 
model accordingly 


N 

Xmr(t) — C mn a nr s n (t x nr ) T tj mr (tj . (55) 

n = 1 

where C is the mixing matrix, or coupling matrix as we call 
it since it describes the coupling between the sources and the 
detectors. With this new signal model in hand, we are already 
making interesting new discoveries in our old data sets. 

7. CONCLUSION 

In this tutorial I have introduced the idea of informed source 
separation. My motivations here are those of a physical sci- 
entist, where I have specific problems in need of accurate 
solutions. In these cases, it is much more advantageous to 
begin with the appropriate model, introduce the known prior 


information, and derive an algorithm specifically engineered 
for the task. 

Historically, while source separation had its beginnings 
in neural networks and information theory [13, 14, 15, 1], 
it was recognized early on that these results were related 
to the maximum likelihood formalism [16, 17, 18]. From 
this point, one is easily led to the Bayesian methodology 
[4, 19, 2, 20, 21], A distinct advantage of the Bayesian ap- 
proach is that it breaks the problem into three pieces: the sig- 
nal model, the cost function, and the search algorithm. The 
researcher begins by choosing an appropriate signal model 
for the physical problem. Once this model has been chosen, 
the researcher uses probability theory to derive the posterior 
probability, which is the cost function to be optimized. With 
a cost function in hand, a search algorithm is employed to 
identify the optimal model parameter values. Each of these 
three pieces can be modified leading to different algorithms 
that vary in applicability, accuracy and efficiency. 

Missing from this short tutorial is a discussion of the nu- 
merous techniques and algorithms that can be used to search 
the parameter space to identify solutions with high probabili- 
ties. I will attempt to refer the reader to a variety of useful and 
important techniques that have been presented in the litera- 
ture. These methods include: gradient ascent search [1, 22], 
iterative fixed point algorithms [23, 10], Markov chain Monte 
Carlo (MCMC) [24, 25], sequential MCMC (also known 
as particle filters) [26, 27], mean field and ensemble meth- 
ods [28, 29, 30, 31], variational Bayes [32, 33], as well as 
Bayesian techniques which utilize sparsity [25]. Last, as an 
aid to better understanding Bayesian methods, I would rec- 
ommend the following introductory references [34, 35, 36]. 

I would also recommend that the reader seek out the other 
papers presented in this special session to get a taste for the 
wide array of methods and applications. My hopes are that 
this tutorial will inspire and enable readers to engineer algo- 
rithms for their specific problems. 
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